Depicted above are Sales vs TV, Radio, and Newspaper advertising budgets, with a blue linear-regression line fit separately to each
Do you think that we could predict Sales using these three advertising budgets?
Perhaps we could utilize a model to help us
\(Sales \approx f(TV, Radio, Newspaper)\)
Where Sales is a response (dependent) variable that we want to predict
TV, Radio, and Newspaper are features (independent variables) that we name \(X_1, X_2, X_3\), respectively
We can refer to the input vector collectively as: \(X = \begin{pmatrix} X_1\\ X_2\\ X_3 \end{pmatrix}\)
This allows us to write the model as: \(Y = f(X) + \epsilon\)
\(Y = f(X) + \epsilon\)
Here \(f\) is some fixed but unknown function of \(X_1,...,X_p\) and \(\epsilon\) is an error term which is independent of \(X\) and has mean zero
In this formulation, \(f\) represents the systematic information that \(X\) provides about \(Y\)
The two main reasons to estimate \(f\) are prediction and inference
Prediction
Inference
In many situations, a set of inputs \(X\) are readily available, but the output \(Y\) cannot be easily obtained
We can make predictions with the following: \(\hat{Y} = \hat{f}(X)\)
Where \(\hat{f}\) represents our estimate of \(f\) and \(\hat{Y}\) represents the resulting prediction for \(Y\)
Example:
The accuracy of \(\hat{Y}\) as a prediction for \(Y\) depends on the following quantities:
Reducible Error
Irreducible Error
Our goal will be on techniques of estimating \(f\) that will minimize the reducible error
There are times when we are interested in understanding the association (relationship) between \(Y\) and \(X_1,...,X_p\)
We will estimate \(f\), but our goal is not necessarily to make prediction for \(Y\), therefore, we need to have a better understanding of the exact form of \(\hat{f}\)
We may be interested in answering the following questions:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between \(Y\) and each predictor be adequately summarized using a linear regression or is the relationship more complicated?
Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating \(f\) may be appropriate
Linear models provide relatively simple and interpretable inference, but may not yield as accurate predictions as other approaches
Alternatively, other non-linear approaches can provide highly accurate predictions of \(Y\), but less interpretable models for inference
Parametric Methods
We make an assumption about the functional form, or shape, of \(f\)
The linear model is a parametric model, where we assume the functional form is linear
After a model has been selected, we will use training data to fit or train the model
Parametric models reduce the problem of estimating \(f\) down to estimating a set of parameters
A potential negative of parametric models is that the model we choose will usually not match the true unknown from of \(f\)
We can try to correct for this by utilizing more flexible models that can fit may different possible functional forms of \(f\)
Non-Parametric Methods
Non-parametric methods do not make explicit assumptions about the functional form of \(f\)
Instead they seek an estimate of \(f\) that gets as close to the data points as possible without being too rough or wiggly
Advantages
Disadvantages
Linear models are easy to interpret, while thin-plate splines are not
Good fit vs. over-fit or under-fit
Parsimony vs. black-box
In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data
In the regression setting, the most commonly-used measure is the mean squared error (MSE)
\(MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2\)
Where \(\hat{f}(x_i)\) is the prediction that \(\hat{f}\) gives for the ith observation
The MSE will be small if the predicted responses are very close to the true responses and large if the predicted and true responses differ substantially
Suppose we fit a model \(\hat{f}(x)\) to some training data, \(Tr = \{x_i,y_i\}^n_1\) and wish to see how well it performs
We could compute the average squared prediction error over \(Tr\)
\(MSE_{Tr} = Ave_{i\epsilon Tr}[y_i-\hat{f}(x_i)]^2\)
This may be biased toward more overfit models
We are not overly concerned with how well our method works on the training data. Instead, we are interested in the accuracy of the predictions that we obtain when we apply our method t the previously unseen test data
Therefore, we can compute the MSE on the test data \(Te = \{x_i,y_i\}^m_1\)
Variance refers to the amount by which \(f\) would change if we estimated it using a different training data set
Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases